Counterfactual Risk Minimization
نویسندگان
چکیده
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. Unlike in supervised learning, where the algorithm receives training examples (xi, y ∗ i ) with annotated correct labels y ∗ i , bandit feedback merely provides a cardinal reward δi ∈ R for the prediction yi that the logging system made for context xi. Such bandit feedback is ubiquitous in online systems (e.g. observing a click δi on ad yi for query xi), while “correct” labels (e.g. the best ad y∗ i for query xi) are difficult to assess. Our work builds upon recent approaches to the off-policy evaluation problem [5], [8], [7], [9], where we re-use data collected from the interaction logs of one bandit algorithm to evaluate another system. These approaches use counterfactual reasoning [3] to derive an unbiased estimate of the system’s performance. Our work centers around the insight that, to perform robust learning, it is not sufficient to have just an unbiased estimate of system performance. We must also reason about how the variances of these estimators differ across the hypothesis space, and pick the hypothesis with the tightest conservative bound on system performance. We first derive generalization error bounds analogous to structural risk minimization [15] for a stochastic hypothesis family. The constructive nature of these bounds suggests a general principle – Counterfactual Risk Minimization (CRM) – for designing methods for batch learning from bandit feedback. Using the CRM principle, we derive a new learning algorithm – Policy Optimizer for Exponential Models (POEM) – for structured output prediction. We evaluate POEM on several multi-label classification problems and verify that its empirical performance supports the theory. Existing approaches for batch learning from logged bandit feedback fall into two categories. The first approach reduces the problem to supervised learning, using techniques like cost weighted classification [16] or the Offset Tree algorithm [2] when the space of possible predictions is small. In contrast, our approach generalizes structured output prediction with exponential-sized prediction spaces.
منابع مشابه
Batch learning from logged bandit feedback through counterfactual risk minimization
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملCounterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملCounterfactual Learning from Bandit Feedback under Deterministic Logging : A Case Study in Statistical Machine Translation
The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that riskaverse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT...
متن کاملEfficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization
Recently, there has been considerable progress towards algorithms for approximating Nash equilibrium strategies in extensive games. One such algorithm, Counterfactual Regret Minimization (CFR), has proven to be effective in two-player zero-sum poker domains. While the basic algorithm is iterative and performs a full game traversal on each iteration, sampling based approaches are possible. For i...
متن کاملThe Self-Normalized Estimator for Counterfactual Learning
This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem. In the BLBF setting, the learner does not receive full-information feedback like in supervised learning, but observes feedback only for the actions taken by a historical policy....
متن کامل